An OLAP Requirements Example: CompSales International (part 14) - Data Mining

12/16/2010 11:36:34 AM

Data Mining

With SSAS, a much more robust selection of capabilities for data mining is available..

Data mining is the process of understanding potentially undiscovered characteristics or distributions of data. Data mining can be extremely useful for OLAP database design in that patterns or values might define different hierarchy levels or dimensions that were not previously known. As you create dimensions, you can even choose a data mining model as the basis for a dimension.

Basically, a data mining model is a reference structure that represents the grouping and predictive analysis of relational or multidimensional data. It is composed of rules, patterns, and other statistical information of the data that it was analyzing. These are called cases. A case set is simply a means for viewing the physical data. Different case sets can be constructed from the same physical data. Basically, a case is defined from a particular point of view. If the algorithm you are using supports the view, you can use mining models to make predictions based on these findings.

Another aspect of a data mining model is using training data. This process determines the relative importance of each attribute in a data mining model. It does this by recursively partitioning data into smaller groups until no more splitting can occur. During this partitioning process, information is gathered from the attributes used to determine the split. Probability can be established for each categorization of data in these splits. This type of data can be used to help determine factors about other data utilizing these probabilities. This training data, in the form of dimensions, levels, member properties and measures, is used to process the OLAP data mining model and further define the data mining column structure for the case set.

In SSAS, Microsoft provides several data mining algorithms (or techniques):

Association Rules— This algorithm builds rules that describe which items are most likely to appear together in a transaction. The rules help predict when the presence of one item is likely with another item (which has appeared in the same type of transaction before).
Clustering— This algorithm uses iterative techniques to group records from a dataset into clusters that contain similar characteristics. This is one of the best algorithms, and it can be used to find general groupings in data.
Sequence Clustering— This algorithm is a combination of sequence analysis and clustering, and it identifies clusters of similarly ordered events in a sequence. The clusters can be used to predict the likely ordering of events in a sequence, based on known characteristics.
Decision Trees— This classification algorithm works well for predictive modeling. It supports the prediction of both discrete and continuous attributes.
Linear Regression— This regression algorithm works well for regression modeling. It is a configuration variation of the Decision Trees algorithm, obtained by disabling splits. (The whole regression formula is built in a single root node.) The algorithm supports the prediction of continuous attributes.
Logistic Regression— This regression algorithm works well for regression modeling. It is a configuration variation of the Neural Network algorithm, obtained by eliminating the hidden layer. This algorithm supports the prediction of both discrete and continuous attributes.
Naïve Bayes— This classification algorithm is quick to build, and it works well for predictive modeling. It supports only discrete attributes, and it considers all the input attributes to be independent, given the predictable attribute.
Neural Network— This algorithm uses a gradient method to optimize parameters of multilayer networks to predict multiple attributes. It can be used for classification of discrete attributes as well as regression of continuous attributes.
Time Series— This algorithm uses a linear regression decision tree approach to analyze time-related data, such as monthly sales data or yearly profits. The patterns it discovers can be used to predict values for future time steps across a time horizon.

To create an OLAP data mining model, SSAS uses either an existing source OLAP cube or an existing relational database/data warehouse, a particular data mining technique/algorithm, case dimension and level, predicted entity, or, optionally, training data. The source OLAP cube provides the information needed to create a case set for the data mining model. You then select the data mining technique (decision tree, clustering, or one of the others). It uses the dimension and level that you choose to establish key columns for the case sets. The case dimension and level provide a certain orientation for the data mining model into the cube for creating a case set. The predicted entity can be either a measure from the source OLAP cube, a member property of the case dimension and level, or any member of another dimension in the source OLAP cube.

Note

The Data Mining Wizard can also create a new dimension for a source cube and enables users to query the data mining data model data just as they would query OLAP data (using the SQL DMX extension or the mining structures browser).

In Visual Studio, you simply initiate the Data Mining Wizard by right-clicking the Mining Structures entry in the Solution Explorer. You cannot create new mining structures from SSMS. When you are past the wizard’s splash screen, you have the option of creating your mining model from either an existing relational database (or data warehouse) or an existing OLAP cube (as shown in Figure 53).

Figure 53. Selecting the definition method to used for the mining structure in the Data Mining Wizard.

You want to define a data mining model that can shed light on product (SKU) sales characteristics and that will be based on the data and structure you have created so far in your Comp Sales Unleashed cube. For this example, you choose to use the existing OLAP cube you already have (from the existing cube method).

You must now select the data mining technique you think will help you find value in your cube’s data. Clustering is probably the best one to start from because it finds natural groupings of data in a multidimensional space. It is useful when you want to see general groupings in your data, such as hot spots. You are trying to find just such things with sales of products (for example, things that sell together or belong together). Figure 54 shows the data mining technique Microsoft Clustering being selected.

Figure 54. Using clustering to identify natural groups in the Data Mining Wizard.

Now you have to identify the source cube dimension to use to build the mining structure. As you can see in Figure 55 , you choose Product Dimension to fit the mining intentions stated earlier.

Figure 55. Identifying the product dimension as the basis for the mining structure in the Data Mining Wizard.

You then select the case key or point of view for the mining analysis. Figure 56 illustrates the case to be based on the product dimension and at the SKU level (that is, the individual product level).

Figure 56. Identifying the basic unit of analysis for the mining model in the Data Mining Wizard.

You now specify the attributes and measures as case-level columns of the new mining structure. Figure 57 shows the possible selections. You can simply choose all the data measures for this mining structure. Then you click the Next button.

Figure 57. Specifying the measure for the mining model in the Data Mining Wizard.

As you can see in Figure 58 , the next few wizard dialogs allow you to specify the mining structure column’s content and data types (use the defaults that were detected for most items unless we specifically describe something different), identify a filtered slice to use for the model training (you don’t need to use this now because you want the whole cube), and finally identify the number of cases to be reserved for model testing (use a percentage of data for testing to be about 33%).

Figure 58. Specifying a column’s content, slice filters, and model data training percentages.

The mining model is now specified and must be named and processed. Figure 59 shows what you have named the mining structure (Product Dimension MS) and the mining model name itself (Product Dimension MM). Also, you select the Allow Drill Through option so you can look further into the data in the mining model after it is processed. Then you click the Finish button.

Figure 59. Naming the mining model and completing the Data Mining Wizard.

When the Data Mining Wizard is complete, the mining structure viewer pops up, with your mining structure case-level column’s specification (on the center left) and its correlation to your cube (see Figure 60).

Figure 60. Your new mining structure in the mining structure viewer.

You must now process the mining structure to see what you come up with. You do this by selecting the Mining Model toolbar option and selecting the Process option. You then see the usual Process dialog, and you have to choose to run this (process the mining structure). After the mining structure processing completes, a quick click on the Cluster Diagram tab shows the results of the clustering analysis (see Figure 61). Notice that because you selected to allow drill through, you can simply right-click any of the clusters identified and see the data that is part of the cluster (and choose Drill Through). This viewer clearly shows that there is some clustering of SKU values that might indicate products that sell together or belong together.

Figure 61. Clustering results and drilling through to the data in the mining model viewer.

If you click the Cluster Profiles tab of this viewer, you see the data value profile characteristics that were processed (see Figure 62).

Figure 62. Cluster data profiles in the mining model viewer.

Figure 63 shows the clusters of data values of each data measure in the data mining model. This characteristic information gives you a good idea of what the actual data values are and how they cluster together.

Figure 63. Cluster characteristics of the data values for each measure in the mining model viewer.

Finally, you can see the cluster node contents at the detail level by changing the mining model viewer type to Microsoft Generic Content Tree Viewer, which is just below the Mining Model Viewer tab on top. Figure 64 shows the detail contents of each model node and its technical specification of a report format.

Figure 64. The Microsoft Generic Content Tree Viewer of the cluster nodes in the mining model viewer.

If you want, you can now build new cube dimensions that can help you do predictive modeling based on the findings of the data mining structures you just processed. In this way, you could predict sales units of one SKU and the number of naturally clustered SKUs quite easily (based on the past data mining analysis). This type of predictive modeling is very powerful.